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We discuss a unified approach to stochastic optimization of pseudo-Boolean 
objective functions based on particle methods, including the cross-entropy 
method and simulated annealing as special cases. We point out the need for 
auxiliary sampling distributions, that is parametric families on binary spaces, 
which are able to reproduce complex dependency structures, and illustrate 
their usefulness in our numerical experiments. We provide numerical evidence 
that particle-driven optimization algorithms based on parametric families 
yield superior results on strongly multi-modal optimization problems while 
local search heuristics outperform them on easier problems. 
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1 Particle optimization 

1.1 Introduction 

1.1.1 Pseudo-Boolean optimization 

We call f : M d := {0, l} d — > K a pseudo-Boolean function. The present work discusses 
approaches to obtain heuristics for the program 

maximize fix) , , 

w (1) 
subject to x e W 

using sequential Monte Carlo techniques. In the sequel, we refer to / as the objective 
function. For an excellent overview of applications of binary programming and equivalent 
problems we refer to the survey paper by Boros and Hammer (2002) and references 
therein. 

The idea to use particle filters for global optimization is not new [(Del Moral et al., 
2006), Section 2.3. l.c], but novel sequential Monte Carlo methodology making use of 
suitable parametric families on binary spaces Schafer and Chopin (2011) may allow to 
construct more efficient samplers for the special case of pseudo-Boolean optimization. 
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We particularly discuss how this methodology connects with the cross-entropy method 
Rubinstein (1997) which is another particle driven optimization algorithm based on 
parametric families. 

The sequential Monte Carlo algorithm as developed by Schafer and Chopin (2011) is 
rather complex compared to local search algorithms such as simulated annealing Kirk- 
patrick et al. (1983) or k-opt local search Merz and Preisleben (2002) which can be 
implemented in a few lines. The aim of this paper is to motivate the use of advanced 
particle methods and sophisticated parametric families in the context of pseudo-Boolean 
optimization and to provide conclusive numerical evidence that these complicated al- 
gorithms can indeed outperform simple heuristics if the objective function has poorly 
connected strong local maxima. This is not at all clear, since, in terms of computational 
time, multiple randomized restarts of fast local search heuristics might very well be more 
efficient than comparatively complex particle approaches. 

1.1.2 Outline 

The article is structured as follows. We first introduce some notation and review how 
to model an optimization problem (1) as a filtering problem on an auxiliary sequence 
of probability distributions. Section 2 describes a sequential Monte Carlo sampler Del 
Moral et al. (2006) designed for global optimization on binary spaces Schafer and Chopin 
(2011). Section 3 reviews three parametric families for sampling multivariate binary 
data which can be incorporated in the proposed class of particle algorithms. Section 
4 discusses how the cross-entropy method Rubinstein (1997) and simulated annealing 
Kirkpatrick et al. (1983) can be interpreted as special cases of the sequential Monte Carlo 
sampler. In Section 5 we carry out numerical experiments on instances of the uncon- 
strained quadratic binary optimization problem. First, we investigate the performance 
of the proposed parametric families in particle-driven optimization algorithms. Secondly, 
we compare variants of the sequential Monte Carlo algorithm, the cross-entropy method, 
simulated annealing and simple multiple-restart local search to analyze their respective 
efficiency in the presence or absence of strong local maxima. 

1.1.3 Notation 

We briefly introduce some notation that might be non-standard. We denote scalars in 
italic type, vectors in italic bold type and matrices in straight bold type. Given a set 
M, we write \M\ for the number of its elements and 1m for its indicator function. For 
a, b £ Z we denote by [a, 6] = {a, . . . , b} the discrete interval from a to b. Given a vector 
x G M d and an index set / C [1, dj, we write xj £ B^' for the sub- vector indexed by I 
and x_i E B rf_ ^l for its complement. We occasionally use the norms ||£c||oo • — max^x^ 
and |sc| := Ya=i \ x i\- 

1.2 Statistical modeling 

1.2.1 Associated probability measures 

For particle optimization, the common approach is defining a family of probability mea- 
sures {71-g: q > 0} associated to the optimization problem max xeV> d f(x) in the sense 
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that 



lim TTp = U Mf , 



where Us denotes the uniform distribution on the set 5 and Mf = a,Tgm&x x£K d f(x) 
the set of maximizers. The idea behind this approach is to first sample from a simple 
distribution, potentially learn about the characteristics of the associated family and 
smoothly move towards distributions with more mass concentrated in the maxima. We 
review two well-known techniques to explicitly construct such a family ir e . 

Tempered family We call {vr e : q > 0} a tempered family, if it has probability mass 
functions of the form 

ir e (-y) := i/ e exp(g/(7)), (2) 
where u^ 1 := E 7 eB<* exp(ff/(7))- 

As q increases, the modes of ir g become more accentuated until, in the limit, all mass 
is concentrated on the set of maximizers. The name reflects the physical interpretation 
of Kq(x) as the probability of a configuration x £ M d for an inverse temperature g and 
energy function — /. This is the sequence used in simulated annealing Kirkpatrick et al. 
(1983). 

Level set family We call {tt s : g > 0} a level set family, if it has probability mass 
functions of the form 



where L+ := {7 € 



^(7) :=|L+ri L +( 7 ), 
- /(7)] < 1} for x* G Mf. 



(3) 



Indeed, is the super-level set of / with respect to the level c = f(x*) — l/g,forg>0, 
and 7T g (-f) is the uniform distribution on L+. As g increases, the support of 7r e becomes 
restricted to the points that have an objective value sufficiently close to the maximum 
of the /. In the limit, the support is reduced to the set of global maximizers. 

Figure 1: Associated sequences ir et for a toy example /: B 3 — > [—20,20]. The colors 
indicate the advance of the sequences from yellow to red. For simplicity, we 
choose Qt = t for t £ [0, 16]. 
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(a) objective function f(x) 



(b) tempered sequence (2) 



(c) level set sequence (3) 



The particle-driven optimization algorithms are computationally more involved than 
local search heuristics since we need to construct a sequence of distributions instead 
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of a sequence of states. We shall see that this effort pays off in strongly multi- modal 
scenarios, where even sophisticated local search heuristics can get trapped in a subset 
of the state space. 

1.2.2 Rare event simulation 

While the tempered sequence is based on a physical intuition, the level set sequence 
has an immediate interpretation as a sequence of rare events since, as g increases, the 
super-level set becomes a 'rare event' with respect to the uniform measure. Rare event 
simulation and global optimization are therefore closely related concepts and methods 
for rare event estimation can often be adapted to serve as optimization algorithms. 

Particle algorithms for rare event simulation include the cross-entropy method Rubin- 
stein (1997) and the sequential Monte Carlo sampler Johansen et al. (2006). The former 
uses the level set sequence, the latter uses a logistic potential family 

7r e ( 7 ) :=*/ ff W(7) -/(**)]), 

where u~ x := E^b* WW " /(**)]) and £: M -> (0,1), £(x) = [1 + exp(-x)]- 1 
denotes the logistic function. Johansen et al. (2006) did not specifically design their 
algorithm for optimization but their approach to static rare event simulation is closely 
related to the particle optimization framework. 

2 Sequential Monte Carlo 

We discuss a static sequential Monte Carlo sampler that uses a transition kernel with 
independent proposals in the move step based on suitable parametric families. This 
methodology has been demonstrated to reliably estimate the mean of the posterior dis- 
tribution in Bayesian variable selection problems for linear normal models Schafer and 
Chopin (2011). In this section, we provide a self-contained description of this framework. 
For a more general overview of sequential Monte Carlo methods we refer to Del Moral 
et al. (2006). 

2.1 Sequential Importance Sampling 

For convenience of notation, we index the sequence of distributions {i^ gt )t directly by t 
rather that by the parameter of the family Qt. We refer to X = (xi, . . . , x n ) J G ~M> nxd and 
w G [0, l] n with \w\ = 1 as a particle system with n particles. We say the particle system 
(w,X.) targets the probability distribution ir if the empirical distribution Ylk=i w kdx k 
converges to ir for n — > oo. 

2.1.1 Importance weights 

We sample (wq,X.q) with x\ t Q, . . . ,a3 n ,o ~ = and set wq = 1/n to initialize the 
system. Suppose we are given a particle approximation (wt, Xt) of nt and want to target 
the subsequent distribution 7r i+ i. For all k G [1, n] and a > 0, we update the weights to 

Uk,t,a , ^g t +a( x k,t) / j \ 

Wk,t, a ■= j where u k ,t, a ■= Wkj—— — , (4) 

l^i=lUi,t,a ir et {Xk,t) 

and 7T g oc n e denotes the unnormalized version of tt q . The normalizing constants v Q and 
defined in equations (2) and (3) are unknown but the algorithm only requires ratios 
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of unnormalized probability mass functions. We refer to a as the step length at time t. 
After updating, the particle system targets the distribution 

^Qt+a ~ Efc=l w k,t,a S Xttk ■ (5) 

As we choose ot larger, that is Tfg t +a further from ir Qt , the weights become more un- 
even and the accuracy of the importance approximation deteriorates. If we repeat the 
weighting step, we just increase a and finally obtain an importance sampling estimate 
of -Koo = Um, with instrumental distribution ttq = U^d. This yields a poor estimator 
since the probability to hit the set Mf with n uniform draws is 1 — (1 — 2~ d \M y|) n and 
decreases rapidly as the dimension d grows. The pivotal idea behind sequential Monte 
Carlo is to alternate moderate updates of the importance weights and improvements of 
the particle system via resampling and Markov transitions. 

Effective sample size The importance weight degeneracy is often measured through 
the effective sample size criterion Kong et al. (1994) defined as 

riniw) := [n^Li w l] ~* € [1/n, 1]. 

The effective sample size is 1 if the weights are uniform, that is equal to 1/n; the effective 
sample size is 1/n if all mass is concentrated in a single particle. 

2.1.2 Finding the step length 

Given any increasing sequence (Tr gt )t, we could repeatedly reweight and monitor whether 
the effective sample size falls below a critical threshold. For the special case of annealing 
via sequential Monte Carlo, however, the effective sample size after weighting r] n {wt^ a ) 
is merely a function of a. For a particle system (Xt, wf) at time t, we pick a step length 
a such that 

rj n (wt) ft = T] n (Wt l a), (6) 

that is we lower the effective sample with respect to the current particle approximation 
by some fixed ratio (3 G (0,1) [(Jasra et al., 2011), (Del Moral et al., 2011)]. This 
ensures a 'smooth' transition between two auxiliary distributions, in the sense that 
consecutive distributions are close enough to approximate each other reasonably well 
using importance weights; in our numerical experiments, we took (3 = 9/10. We obtain 
the associated sequence (gt)t by setting g t +i = Qt+ott where at is a unique solution of (6). 
Since rj n {wt^ a ) is continuous and monotonously decreasing in a we can use bi-sectional 
search to solve (6). 

2.2 Conditioning the particle system 

In this section, we discuss how to condition the weighted system and improve its quality 
before we proceed with the next weighting step. 

2.2.1 Resampling 

We replace the system (w t +i,X. t ) targeting irt+i by a selection of particles sci, ... ,x n 
drawn from the current particle reservoir x\ t, . . . , x n t such that 

IE (ra(acfc)) = nw k , 
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where n(x) denotes the number of particles identical with x. Thus, in the resampled 
system particles with small weights have vanished while particles with large weights 
have been multiplied. For the implementation of the resampling step, there exist sev- 
eral recipes. We could apply a multinomial resampling Gordon et al. (1993) which is 
straightforward. There are, however, more efficient ways like residual Liu and Chen 
(1998), stratified Kitagawa (1996) and systematic resampling Carpenter et al. (1999). 
We use the latest in our simulations, see Procedure 1. 



Procedure 1: Resampling (systematic) 

[ht] Input: w = (wi, . . . ,w n ), X= (x 1; . . . , a;„) T 
d <- nio, i <— 1, c <— V\ 
sample u ~ Wro.i] 
for k £ [l,n] do 

while c < u do i 4— i + 1, c c + Vi 

Xk <— Xi, u <— u + 1 
end 

return X = [x\ . . . , x„) J 



2.2.2 Moving the system 

If we repeated the weighting and resampling steps several times, we would rapidly reduce 
the number of different particles to a very few. The key to fighting the depletion of 
the particle reservoir is moving the particles according to a Markov transition kernel 
Kt+i with invariant measure ftt+i- The particle x^\ +1 is by construction approximately 
distributed according to TTt+i, and a draw 

x k,t+i ~ I x k,t+i) 

is therefore again approximately distributed according to Ttt+i- The last sample of the 
generated Markov chain (x^ t+1 , . . . , x^\ +1 ) is, for sufficiently many move steps s € N, 
almost exactly distributed according to the invariant measure itt+i and independent of 
its starting point. 

2.2.3 Stopping rule 

While we could always apply a fixed number of move steps, we rather use an adaptive 
stopping criterion based on the number of distinct particles. 

Particle diversity We define the particle diversity as 

Cn(X) -n-^ixk-. ke [l,n]}\ G [l/n,l]. 

Ideally, the sample diversity Cn(X) should correspond to the expected diversity 

Cn(vr) := 1 A TT 1 J] 7 eB d ^{x& d : c n ir(x)>l}h)^ 

where c n is the smallest value that solves X^eB^ L c "- 7r (0')J — n - This is the particle 
diversity we would expect if we had an independent sample from n(x). Therefore, if 
Kt+i is fast-mixing, we want to move the system until 

C„(xW)«Cnfa+l)- 
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Since the quantity on the right hand side is unknown, we stop moving the system as 
soon as the particle diversity reaches a steady state we cannot push it beyond Schafer 
and Chopin (2011). 

More precisely, we stop if the absolute diversity is above a certain threshold ~ 0.95 
or the last improvement of the diversity is below a certain threshold £ A > 0. We always 
stop after a finite number of steps but the thresholds (* and need to be calibrated 
to the efficiency of the transition kernel. For slow-mixing kernels, we recommend to 
perform batches of consecutive move steps instead of single move steps. 

If the average acceptance rate A of the kernel (see Section 2.2.4) is smaller than it 
is likely that the algorithm stops after the first iteration although further moves would 
have been necessary. We could adaptively adjust the threshold £a to be proportional to 
an estimate of the average acceptance rate; for our numerical experiments, however, we 
kept it fixed to ( A ~ 10~ 2 . 



Procedure 2: Move 

Input- X = ^ ' ' ' ' ' " ° ) T tar § etin g n 

«(7 I •) with Al) = ExeB 7r ( a; ) K (7 I x) 

s <- 1 
repeat 

sample ccl ~ t(« | x^~ X ') for all k e 
until C(X (s) ) - C(X ( ^ 1} ) < C A or C(X (s) ) > C* 
return X (s) = (x[ s} . . . , x { n s) y 



2.2.4 Transition kernels 

Most transition kernels in Monte Carlo simulations are some variant of the Metropolis- 
Hastings kernel (see e.g. Robert and Casella (2004)), 

Kt+i (7 I x) := \ qt+1 (-y,x)q t+1 (~f \ x) + 5 x (j) 1 - ^ v eM^ K+i (tf> x)q t +i(y \ x 



where we sample from the kernel by proposing a new state 7 ~ Qt+i{l \ x ) an d accepting 
the proposal with probability 

W7,») :=1A §±1^±^ (7) 
+ 7r t+ i(sr)gt+i(7 | x) 

or returning x otherwise. Again, we denote by itt+i °c %t+l the unnormalized version 
of 7Tt+i since the kernel only requires the ratio of the unnormalized probability mass 
functions. 

Symmetric kernel On binary spaces, a common choice for the proposal distribution 
is 

9(7 I x) = TL lPk S k (\x -l\)k\(d- k)\/d\, (8) 
with weight vector p E [0, l] d normalized such that \p\ = 1. 

With probability p k , the kernel proposes a uniform draw from the ^-neighborhood of x, 

N k (x) := {7 eJB d : \x - ~f\ = k}. (9) 



7 



We refer to this type of kernel as symmetric kernel since q(~y \ x) = q(x | 7) and equation 
(7) simplifies. This class of kernels provide a higher mutation rate than the random-scan 
Gibbs kernel (see Schafer and Chopin (2011) for adiscussion) . 

Locally operating transition kernels of the symmetric type are known to be slowly 
mixing. If we put most weight on small values of k, the kernel only changes one or a 
few entries in each step. If we put more weight on larger values of k, the proposals will 
hardly ever be accepted if the invariant distribution ir is multi-modal. Ideally, we want 
the particles sampled from the transition kernel to be nearly independent after a few 
move steps which is often hard to achieve using local transition kernels. 

Adaptive independent kernel For the sequential Monte Carlo algorithm, we use 
adaptive independent kernels which have proposal distributions of the kind 

9(7 I x) = q e (j), 9 G 9, 

which do not depend on the current state x but have a parameter 9 which we adapt 
during the course of the algorithm. 

The adaptive independent kernel is rapidly mixing if we can fit the parametric family 
qe such that the proposal distribution q t +\ = qe t+1 is sufficiently close to the target 
distribution wt+i, yielding thus, on average, high acceptance rates X qt+1 . The general 
idea behind this approach is to take the information gathered in the current particle 
approximation into account (see e.g. Chopin (2002)). The usefulness of this strategy for 
sampling on binary spaces has been illustrated by Schafer and Chopin (2011). 

We fit a parameter 9t+i to the particle approximation of Ttt+i according to some 
suitable criterion. Precisely, 9t+i is taken to be the maximum likelihood or method 
of moments estimator applied to the weighted sample (w i+ i,X t ). The choice of the 
parametric family qe is crucial to the implementation of a sequential Monte Carlo sampler 
with adaptive independent kernel. We discuss this issue in detail in Section 3. 

Adaptation could, to a certain extent, also be done for local transition kernels. Nott 
and Kohn (2005) propose an adaptive kernel which replaces the full conditional dis- 
tribution of the Gibbs sampler by an easy to compute linear approximation which is 
estimated from the sampled particles. This method accelerates Gibbs sampling if the 
target distribution n is hard to evaluate but does not provide fast mixing like the adap- 
tive independent kernel (see Schafer and Chopin (2011) for a comparison). 

Still, the use of local kernels in the context of the proposed sequential Monte Carlo 
algorithm might be favorable if, for instance, the structure of the problem allows to 
rapidly compute the acceptance probabilities of local moves. Further, batches of lo- 
cal moves can be alternated with independent proposals to ensure that the algorithm 
explores the neighborhood of local modes sufficiently well. 

2.3 Remark on discrete state spaces 

Since the sample space B rf is discrete, a given particle is not necessarily unique. This 
raises the question whether it is sensible to store multiple copies of the same weighted 
particle in our system. In the sequel, we discuss some more details concerning this issue 
which has only been touched upon briefly by Schafer and Chopin (2011). 

Let n(x) denote the number of copies of the particle x in the system (w,X). Indeed, 
for parsimonious reasons, we could just keep a single representative of x and aggregate 
the associated weights to w*(x) = n(x) w(x). 
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2.3.1 Impact on the effective sample size 

Shifting weights between identical particles does not affect the nature of the particle 
approximation but it obviously changes the effective sample size rj n (w) which is undesir- 
able since we introduced the effective sample size as a criterion to measure the goodness 
of a particle approximation. From an aggregated particle system, we cannot distinguish 
the weight disparity induced by reweighting according to the importance function (4) 
and the weight disparity induced by multiple sampling of the same states which occurs 
if the mass of the target distribution is concentrated. More precisely, we cannot tell 
whether the effective sample size is actually due to the gap between Tr t and itt+i or the 
presence of particle copies due to the mass of nt concentrating on a small proportion of 
the state space which occurs by construction of the auxiliary distribution in Section 1.2. 

2.3.2 Impact on the resample-move step 

Aggregating the weights means that the number of particles is not fixed at runtime. 
In this case, the straightforward way to implement the move step presented in Section 
2.2.2 is breaking up the particles into multiple copies corresponding to their weights and 
moving them separately. But instead of permanently splitting and pooling the weights 
it seems more efficient to just keep the multiple copies. 

We could, however, design a different kind of resample-move algorithm which first 
augments the number of particles in the move step and then resamples exactly n weighted 
particles from this extended system using a variant of the resampling procedure proposed 
by Fearnhead and Clifford (2003). A simple way to augment the number of particles is 
sampling and reweighting via 

(!) I I (0)\ (!) \ (0) f-i \\ 

x k ~ ?t+l(* I x k )' W k = W k = ~ A )> 

where A = X qt+1 (x^\ x£ ) denotes the acceptance probability (7) of the Metropolis- 
Hastings kernel. We tested this variant but could not see any advantage over the stan- 
dard sampler presented in the preceding sections. For the augment-resample type algo- 
rithm the implementation is more involved and the computational burden significantly 
higher. In particular, the Rao-Blackwellization effect one might achieve when replacing 
the accept-reject steps of the transition kernel by a single resampling step does not seem 
to justify the extra computational effort. 

Indeed, aggregating the weights does not only prevent us from using the effective 
sample size criterion, but also requires extra computational time of O(nlogn) in each 
iteration of the move step since pooling the weights is as complex as sorting. With our 
application in mind, however, computational time is more critical than memory, and we 
therefore recommend to refrain from aggregating the weights. 

3 Parametric families on binary spaces 

We review three parametric families on M d . In contrast to the similar discussion in 
Schafer and Chopin (2011), we also consider a parametric family which cannot be used 
in sequential Monte Carlo samplers but in the context of the cross-entropy method. For 
more details on parametric families on binary spaces we refer to Schafer (2012). 
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3.1 Suitable parametric families 

We frame some properties making a parametric family suitable as proposal distribution 
in sequential Monte Carlo algorithms. 

(a) For reasons of parsimony, we prefer a family of distributions with at most d(d + l) /2 
parameters like the multivariate normal. 

(b) Given a sample X = (x\, . . . , x n ) J from the target distribution tt, we need to estimate 
6* in a reasonable amount of computational time. 

(c) We need to generate samples Y = (yi, . . . ,y m ) J from the family qg. We need the 
rows of Y to be independent. 

(d) For the sequential Monte Carlo algorithm, we need to evaluate qg(y) point- wise. 
However, the cross-entropy method still works without this requirement. 

(e) We want the calibrated family qg* to reproduce e.g. the marginals and covariance 
structure of tt to ensure that the parametric family qg* is sufficiently close to tt. 

3.2 Product family 

The simplest non-trivial distributions on M d are certainly those having independent 
components. 

Product family For a vector m E (0, l) d of marginal probabilities, we define the prod- 
uct family 

Q^-UUmTa-mi) 1 ^. (10) 

3.2.1 Properties 

We check the requirement list from Section 3.1: (a) The product family is parsimonious 
with dim(#) = d. (b) The maximum likelihood estimator rh is the weighted sample 
mean, (c) We can easily sample y ~ q^. (d) We can easily evaluate the mass function 
Qm(v)- ( e ) However, the product family does not reproduce any dependencies we might 
observe in (to, X). 

The last point is the crucial weakness which makes the product family impractical 
for particle optimization algorithms on strongly multi-modal problems. Consequently, 
the rest of this section deals with ideas on how to sample binary vectors with a given 
dependence structure. There are, to our knowledge, two major strategies to this end. 

(1) We construct a generalized linear model which permits to compute the conditional 
distributions. We apply the chain rule and write qg as 

Veil) = qehi) nf= 2 Qofri I 7l:i-i) ; (11) 
which allows to sample the entries of a random vector component-wise. 

(2) We sample from an auxiliary distribution ipg and map the samples into B rf . We call 

qe{l) = f T -i h) M v )dv (12) 
a copula family, although we refrain from working with explicit uniform marginals. 
We first present a generalized linear model and then review a copula approach. 
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3.3 Logistic conditionals family 



Even for rather simple non-linear models we usually cannot derive closed-form expres- 
sions for the marginal probabilities required for sampling according to (11). Therefore, 
we might directly construct a parametric family from its conditional probabilities. 

Logistic conditionals family We define, for a lower triangular matrix A G M. dxd , the 
logistic conditionals family as 



d(7) := E, 



e[MT 



an + 2^ 



j=l a ijlj 



l-£[au + Y,)=i a ij1j 



where I: K — )■ (0, 1), t[x) = [1 + exp(— x)] 1 is the logistic function. We readily identify 
the product family as the special case A = diag^ _1 (m). 



The virtue of the logistic conditionals family is that, by construction, we can sample 
a random vector component-wise while the full probability q A {y) of the sample y is 
computed as a by-product of Procedure 3. We refer to the Online Supplement for 
instructions on how to fit the parameter A. 



Procedure 3: Sampling via chain rule factorization 

y = (0,...,0), p~Tl 
for i € [1, dj do 

r <- qi(Vi = 1 I Vl:i-l) = K a ii + Sj=l a ijV]) 

u~W[0,l] 

if u < r then y L 1 




p-r if yi = 1 

P ■ (1 - r) if y l = 



end 

return y, p 



3.3.1 Properties 

We check the requirement list from Section 3.1: (a) The logistic conditionals family is 
sufficiently parsimonious with dim(#) = d(d + l)/2. (b) We can fit the parameter A via 
likelihood maximization. The fitting is computationally intensive but feasible, (c) We 
can sample y ~ using the chain rule factorization (11). (d) We can exactly evaluate 
Qa(v)- ( e ) T ne family q^ reproduces the dependency structure of the data X although 
we cannot explicitly compute the marginal probabilities. 

3.4 Gaussian copula family 

Let ifg be a family of multivariate auxiliary distributions on X and r : X — > M d a mapping 
into the binary space. We can sample from the copula family (12) by setting x = t[v) 
for a draw v ~ ipo from the auxiliary distribution. Most multivariate parametric families 
with at most d{d + l)/2 parameters appear to either have a rather limited dependency 
range or they do not scale to higher dimensions Joe (1996). Therefore, the natural and 
seemingly only viable option for (pg is the multivariate normal distribution Emrich and 
Piedmonte (1991). 
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Gaussian copula family For a vector a G ]R d and a correlation matrix S £ M dxd , we 
introduce the mapping 

r a : R d ^B d , r a (v) := (1^^), . . . ,t(-oo,a d ](v d )), 
and define the Gaussian copula family as 

<s(7) := (27r)-3detE-3 J T -i h) exp (-± rTS" 1 !*) dv. 
For index sets / C [1, dj, the cross-moments 

m i = E 7eB d <s(7) rii 6 / 7< 

are equal the cumulative distribution function of the multivariate normal with respect to 
the entries indexed by I (see Schafer (2012) for a more detailed discussion). In particular, 
the first and second moments are 

rm = ^i(ai), rriij = ^ 2 {ai,a j ;a ij ), i,j £ [l,d], 

where ^i(-) and <p2(-, S denote the cumulative distribution functions of the univari- 
ate and bivariate normal distributions with zero mean, unit variance and correlation 
coefficient ct^- G [—1,1]. We refer to the Online Supplement for instructions on how to 
fit the parameters a and XI. 



3.4.1 Properties 

We check the requirement list from Section 3.1: (a) The Gaussian copula family is 
sufficiently parsimonious with dim(#) = d(d+ l)/2. (b) We can fit the parameters a and 
5] via method of moments. However, the parameter XI is not always positive definite, (c) 
We can sample y ~ gg c s using y = r a (v) with v ~ <^£. (d) We cannot easily evaluate 
gg c s (y) since this requires computing high-dimensional integral expressions which is a 
computationally challenging problem in itself (see e.g. Genz and Bretz (2009)). The 
Gaussian copula family is therefore less useful for sequential Monte Carlo samplers but 
can be incorporated into the cross-entropy method reviewed in Section 4.2. (e) The 
family qg c E reproduces the exact mean and, possibly scaled, correlation structure. 

3.5 Toy example 

We briefly discuss a toy example to illustrate the usefulness of the parametric families. 
For the quadratic function 



f(x) = x J Fx, F := 



/I 2 


1 o\ 


2 1 


-3 -2 


1 -3 


1 2 


\0 -2 


2 -2/ 



(13) 



the associated probability mass function Tr(y) oc exp(7 T F7) has a correlation matrix 



R 



/ 1 0.127 -0.106 -0.101\ 

0.127 1 -0.941 -0.866 

-0.106 -0.941 1 0.84 

V-0.101 -0.866 0.84 1 / 
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which indicates that this distribution has considerable dependencies and its mass func- 
tion is therefore strongly multi-modal. We generate pseudo-random data from w, adjust 
the parametric families to the data and plot the mass functions of the fitted parametric 
families. 

Figure 2 shows how the three parametric families cope with reproducing the true mass 
function. Clearly, the product family is not close enough to the true mass function to 
yield a suitable instrumental distribution while the logistic conditional family almost 
copies the characteristics of tt and the Gaussian copula family allows for an intermediate 
goodness of fit. 

Figure 2: Toy example showing how well the parametric families replicate the mass func- 
tion of the distribution 7r(7) oc exp(7 T F7) as defined in (13). 
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(c) Logistic conditionals family 9a (7) 
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(d) Gaussian copula family q a ,s{-f) 



4 Optimization algorithms 

In this section, we provide a synopsis of all steps involved in the sequential Monte 
Carlo algorithm and connect this framework to the cross-entropy method and simulated 
annealing. In Table 1, we state the necessary formulas for the tempered and the level 
set sequence introduced in Section 1.2. 

4.1 Sequential Monte Carlo 

For convenience, we summarize the complete sequential Monte Carlo sampler in Algo- 
rithm 1. Note that, in practice, the sequence 7r ft is not indexed by t but rather by Qt, 
which means that the counter t is only given implicitly. 

The algorithm terminates if the particle diversity sharply drops below some threshold 

5 > which indicates that the mass has concentrated in a single mode. If we use a 
kernel with proposals from a parametric family qg t , we might already stop if the family 
degenerates in the sense that only a few components of qg t , say less than d* = 12, 
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are random while the others are constant ones or zeros. In this situation, additional 
moves using a parametric family are a pointless effort. We either return the maximizer 
within the particle system or we solve the subproblem of dimension d* by brute force 
enumeration. We might also perform some final local moves in order to further explore 
the regions of the state space the particles concentrated on. 

Algorithm 1: Sequential Monte Carlo optimization 

Input: /: B d -> K 

sample ~ U^d for all k 6 [l,n]. 

a <— find step length(0,X), 

w <— importance weights (a, ir, X) 

while Cn(X) > S do 

qe fit parametric family^, X) (see Section 3) 

X <— resample(u;, X) (Procedure 1) 

X <s— move(/tj ige , X) (Procedure 2) 

a 4— find step length(g, X) 

w <— importance weight s( a, Tr g , X) 

q <— g + a 

end 

return argmax xe{a;i Xn} f(x) 



4.2 Cross-entropy method 

For the level set sequence, the effective sample size is the fraction of the particles which 
have an objective function value greater than max a . gB d f(x) — l/(gt + a); see Table 1 
and equation (3). The remaining particles are discarded since their weights equal zero. 
Consequently, there is no need to explicitly compute at as a solution of (6). We simply 
order the particles according to their objective values f(xk) and only keep the n(l— /3) 
particles with the highest objective values. 

Rubinstein Rubinstein (1997), who popularizes the use of level set sequences in the 
context of the cross-entropy method, refers to n(l — f3) as the size of the elite sample. 
The cross-entropy method has been applied successfully to a variety of combinatorial 
optimization problems, some of which are equivalent to pseudo-Boolean optimization 
Rubinstein and Kroese (2004), and is closely related to the proposed sequential Monte 
Carlo framework. 

However, the central difference between the cross-entropy method and the sequential 
Monte Carlo algorithm outlined above is the use of the invariant transition kernel in the 
latter. We obtain the cross-entropy method as a special case if we replace the kernel Kt 
by its proposal distribution qg t . The sequential Monte Carlo approach uses a smooth 
family of distributions {7^: g > 0} and explicitly schedules the evolution ir gt which in 
turn leads to the proposal distributions qg t . The cross-entropy method, in contrast, 
defines the subsequent proposal distribution 

Qe t+1 ~ ^1 L + +1 

without any reference sequence nt to balance the speed of the particle evolution. 
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Table 1: Formulas for optimization sequences 



exp(gf) 



Ut, a (x k ,t) 

A gt+1 (7 I x k>t ) 



1 A 



"ELi e2a/(a!M) 

e a(/(T)-/(«Bfc,t)) 



,logg t (-y)-loggt(aj fc]t ) 



|{a!fc it | fcG [i,nl}nL 



1 



1 A 



n 



Jog gt(-y)-log qt(x k ,t) 



In order to decelerate the advancement of the cross-entropy method, we introduce a 
lag parameter r 6 [0, 1) and use a convex combination of the previous parameter Bt—\ 
and the parameter 9t fit to the current particle system, setting 

t :=(l-r)0 t + T0 t _i. 

However, there are no guidelines on how to adjust the lag parameter during the run 
of the algorithm. Therefore, the sequential Monte Carlo algorithm is easier to calibrate 
since the reference sequence nt controls the stride and automatically prevents the system 
from overshooting. 

On the upside, the cross-entropy method allows for a broader class of auxiliary distri- 
butions q$ t since we do not need to evaluate qe t (x) point- wise which is necessary in the 
computation of the acceptance probability of the Hastings kernel; see Section 3.4. 



4.3 Simulated annealing 

A well-studied approach to pseudo-Boolean optimization is simulated annealing Kirk- 
patrick et al. (1983). While the name stems from the analogy to the annealing process 
in metallurgy, there is a pure statistical meaning to this setup. We can picture simulated 
annealing as approximating the mode of a tempered sequence (2) using a single particle. 
Since a single observation does not allow for fitting a parametric family, we have to rely 
on symmetric transition kernels (8) in the move step. 

A crucial choice is the sequence Qt which in this context is often referred to as the 
cooling schedule. There is a vast literature advising on how to calibrate Qt where a typical 
guideline is the expected acceptance rate of the Hastings kernel. We calibrate Qt such 
that the empirical acceptance rate 

h-s-.t ■= Y?r=t-s -V, s > 

follows approximately (t + l) -5 for t £ [0, 1]. There are variants of simulated annealing 
which use more complex cooling schedules, tabu lists and multiple restarts, but we stick 
to this simple version for the sake of simplicity. Algorithm 2 describes the version we 
use in our numerical experiments in Section 5.3.3. 
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Algorithm 2: Simulated annealing optimization 



[ht] Input: /: B d ->• E, T* E R 
a; ~ W B d, a;* a;, i 0, Ta (time elapsed) 
while T A < T* do 
sample 

7 ~ Wjv^b), u ~ W[ ,i], A t •<- 1 A exp [g t (f(j) - f(x))] 

if u < X t then a; •<— 7 

if /(a;) > /(cc*) then x* <- x 

adjust Q t such that A t _ s:t « (l + T A /T*)- 5 

t<-t+l 
end 

return a;* 



4.4 Randomized local search 

We describe a greedy local search algorithm which works on any state space that allows 
for defining a neighborhood structure. The typical neighborhood on binary spaces is the 
^-neighborhood defined in (9). A greedy local search algorithm computes the objective 
value of all states in the current neighborhood and moves to the best state found until 
a local optimum is reached. The local search algorithm is called /c-opt if it searches the 
neighborhood U^ =1 Ni(-) (see e.g. Merz and Freisleben (2002) for a discussion). 

The algorithm can be randomized by repeatedly restarting the procedure from ran- 
domly drawn starting points. There are more sophisticated versions of local search 
algorithms exploit the properties of the objective function but even a simple local search 
procedure can produce good results Alidaee et al. (2010). Algorithm 3 describes the 
1-opt local search procedure we use in our numerical experiments in Section 5.3.3. 



Algorithm 3: Randomized local search 

[ht] Input: /: M d -> K, T* G K 
x* ~ U B d, Ta 4- (time elapsed) 
while T A < T* do 

x — U V d 

while x is not a local optimum do 

x^argmax yeJVl ( a! )/(7) 
end 

if f(x) > f(x*) then x* x 
end 

return x* 
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5 Applications 



5.1 Unconstrained Quadratic Binary Optimization 

5.1.1 Introduction 

It is well-known that any pseudo-Boolean function f : M d — > R can be written as a 
multi-linear function 

f( X )= j2 /(i/(i),...,i/(d))n^ n (!-^)= e ^n^ ( i4 ) 

jc[i,d] ie/ ie[i,d]\/ /c[i,d] ie/ 

where aj G 1 are real- valued coefficients. We say the function / is of order k if the 
coefficients aj are zero for all / C [l,<i] with |7| > k. While optimizing a first order 
function is trivial, optimizing a non-convex second order function is already an NP-hard 
problem Garey and Johnson (1979). 

In the sequel, we focus on optimization of second order pseudo-Boolean functions to 
exemplify the stochastic optimization schemes discussed in the preceding sections. If / 
is a second order function, we restate program (1) as 

maximize x J Fx 

a 15) 
subject to x £ B , 

where F 6 M dxd is a symmetric matrix. We call (15) an unconstrained quadratic binary 
optimization problem (uqbo); we refer to Boros et al. (2007) for a list of applications and 
equivalent problems. In the literature it is also referred to as unconstrained quadratic 
Boolean or bivalent or zero-one programming Beasley (1998). 

5.1.2 Particle optimization and meta-heuristics 

Meta-heuristics are a class of algorithms that optimize a problem by improving a set 
of candidate solutions without systematically enumerating the state space; typically 
they deliver solutions in polynomial time while an exact solution has exponential worst 
case running time. The outcome is neither guaranteed to be optimal nor deterministic 
since most meta-heuristics are randomized algorithms. We briefly discuss the connection 
to particle optimization against the backdrop of the unconstrained quadratic binary 
optimization problem where we roughly separate them into two classes: local search 
algorithms and particle-driven meta-heuristics. 

Local search algorithms iteratively improve the current candidate solution through 
local search heuristics and judicious exploration of the current neighborhood; examples 
are local search Boros et al. (2007); Merz and Freisleben (2002), tabu search Glover et al. 
(1998); Palubeckis (2004), simulated annealing Katayama and Narihisa (2001). Particle 
driven meta-heuristics propagate a set of candidate solutions and improve it through 
recombination and local moves of the particles; examples are genetic algorithms Merz 
and Freisleben (1999), memetic algorithms Merz and Katayama (2004), scatter search 
Amini et al. (1999). For comparisons of these methods we refer to Hasan et al. (2000) 
or Beasley (1998). 

The sequential Monte Carlo algorithm and the cross-entropy method are clearly in the 
latter class of particle-driven meta-heuristics. The idea behind sequential Monte Carlo 
is closely related to the intuition behind population (or swarm) optimization and genetic 
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(or evolutionary) algorithms. However, the mathematical framework used in sequential 
Monte Carlo allows for a general formulation of the statistical properties of the particle 
evolution while genetic algorithms are often problem-specific and empirically motivated. 

5.1.3 Particle optimization and exact solvers 

If we can explicitly derive the multi-linear representation (14) of the objective function, 
there are techniques to turn program (1) into a linear program. For the UQBO it reads 

d i— 1 d 

maximize fix) = 2 ^ ^ fij^ij + 

i=l j=l i=l 

subject to x £ JB d ( d+1 )/ 2 (16) 
Xy < Xjj > for all i, j 6 

Xij ^ X%% -\- Xjj 1 J 

Note, however, that there are more parsimonious linearization strategies than this straight- 
forward approach [(Hansen and Meyer, 2009), (Gueye and Michelon, 2009)]. The trans- 
formed problem allows to access the tool box of linear integer programming which con- 
sist of branch-and-bound algorithms that are combined with rounding heuristics, var- 
ious relaxations techniques and cutting plane methods [(Pardalos and Rodgers, 1990), 
(Palubeckis, 1995)]. 

Naturally, the question arises whether particle-driven meta-heuristics can be incor- 
porated into exact solvers to improve branch-and-bound algorithms. Indeed, stochastic 
meta-heuristics deliver lower bounds for maximization problems, but particle-driven al- 
gorithms are computationally somewhat expensive for this purpose unless the objective 
function is strongly multi-modal and other heuristics fail to provide good results; see the 
discussion in Section 5.2.5. 

However, the sequential Monte Carlo approach in combination with the level set se- 
quence (3) might also be useful to determine a global branching strategy, since the 
algorithm provides an estimator for 

ic ■■= i^r 1 E 76 B^7i L +(7), 

which is the average of the super-level set := {x G B d : f(x) > c}. These estimates 
given for a sequence of levels c might provide branching strategies than are superior to 
local heuristics or branching rules based on fractional solutions. A further discussion of 
this topic is beyond the scope of this paper but it certainly merits consideration. 

5.2 Construction of test problems 
5.2.1 Introduction 

The meta-heuristics we want to compare do not exploit the quadratic structure of the 
objective function and might therefore be applied to any binary optimization program. 
If the objective function can be written in multi-linear form like (15) there are efficient 
local search algorithms Boros et al. (2007); Merz and Freisleben (2002) which exploit 
special properties of the target function and easily beat particle methods in terms of 
computational time. 
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Therefore, the use of particle methods is particularly interesting if the objective func- 
tion is expensive to compute or even a black box. The posterior distribution in Bayesian 
variable selection for linear normal models is an example of such an objective function 
(see Schafer and Chopin (2011) and references therein). We stick to the UQBO for our 
numerical comparison since problem instances of varying difficulty are easy to generate 
and interpret while the results carry over to general binary optimization. 

In the vast literature on UQBO, authors typically compare the performance of meta- 
heuristics on a suite of randomly generated problems with certain properties. Pardalos 
(1991) proposes standardized performance tests on symmetric matrices F 6 "Z dxd with 
entries fij drawn from the uniform 

Qc(k) := 2^[-c,c](*0, ceN - 

The test suites generated by Beasley (1990, OR-library) and Glover et al. (1998) follow 
this approach have been widely used as benchmark problems in the UQBO literature (see 
Boros et al. (2007) for an overview). In the sequel we discuss the impact of diagonal 
dominance, shifts, the density and extreme values of F on the expected difficulty of the 
corresponding UQBO problem. 



5.2.2 Diagonal 

Generally, stronger diagonal dominance in F corresponds to easier UQBO problems Bil- 
lionnet and Sutter (1994). Consequently, the original problem generator presented by 
Pardalos (1991) is designed to draw the off-diagonal elements from a uniform on a dif- 
ferent support {—q, q} with ggN. 

In this context, we point out that the impact of diagonal dominance carries over to 
the statistical properties of the tempered distributions (2) we defined in the introduc- 
tory Section 1.2. Indeed, stronger diagonal dominance in F corresponds to exponential 
quadratic distributions 

= exp(7TF7) 
HJ> ' E 7£Bd exp( 7 TF 7 ) 

having lower dependencies between the components of *y. We can analytically derive a 
parameter A £ M. dxd for a logistic conditionals family q^ that approximates vr('y) where 
the quality of the approximation increases as the diagonal of F becomes more dominant 
Schafer (2012). We can accelerate the sequential Monte Carlo algorithm by initializing 
the system from q^ instead of U^d. However, we did not exploit this option to keep the 
present work more concise. 

For positive definite F y 0, the optimization problem is convex and can be solved 
in polynomial time Kozlov et al. (1979); in exact optimization, this fact is exploited to 
construct upper bounds for maximization problems Poljak and Wolkowicz (1995). We 
observe a corresponding complexity reduction in statistical modeling. For F y 0, the 
auxiliary distribution 

M = 7 T F7 

2 d - 2 (lTFl+tr(F))' 

is a feasible mass function, and we can derive analytical expressions concerning all cross- 
moments and marginal distributions Schafer (2010) which allows to largely analyze the 
properties of vr(7) without enumerating the state space. 
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5.2.3 Shifts 

The global optimum of the UQBO problem is more difficult to detect as we shift the 
entries of the matrix F but the relative gap between the optimum and any heuristic 
value diminishes. If we sample fij = f[j from a uniform on the shifted support 

q c , T (k) :=^[_ c+T;C+T ](/c), cGN, r G [-c,c], 

we obtain an objective function 



where = means equality in distribution. Hence, with growing |r| the optimum depends 
less on F and the relative gap between the optimum and a solution provided by any 
meta-heuristic vanishes. Boros et al. (2007) define a related criterion 



and report a significant impact of p on the solution quality of their local search algorithms 
which is not surprising. 

5.2.4 Density 

The difficulty of the optimization problem is related to the number of interactions, that 
is the number of non-zero elements of F. We call the proportion of non-zeros the density 
of F. Drawing fij from the mixture 



we adjust the difficulty of the problem to a given expected density to. 

Note that not all algorithms are equally sensitive to the density of F. Using the basic 
linearization (16), each non-zero off-diagonal element requires the introduction of an 
auxiliary variable and three constraints. Thus, the expected total number of variables 
and the expected total number of constraints, which largely determine the complexity 
of the optimization problem, are proportional to the density co. 

On the other hand, many randomized approaches, including the particle algorithms 
discussed in Section 2, are less sensitive to the density of the problem in the sense that 
replacing zero elements by small values has a minor impact on the performance of these 
algorithms. Rather than the zero/non-zero duality, we suggest that the presence of 
extreme values determines the difficulty of providing heuristic solutions. 

5.2.5 Extreme values 

The uniform sampling approach advocated by Pardalos (1991) is widely used in the 
literature for comparing meta-heuristics. Certainly, particle-driven methods are com- 
putationally too expensive to outperform local search heuristics on test problems with 
uniformly drawn entries; Beasley (1998) confirms this intuition with respect to genetic 
algorithms versus tabu search and simulated annealing. However, the uniform distribu- 
tion does not produce extreme values and it is vital to keep in mind that these have an 
enormous impact on the performance of local search algorithms. 



f T (x) = x J F T x = x J (F° + t1V)x = fo(x) +t\x 



d 




q c ,uj(k) = wW[_ C)C ](fe) + (1 - u)6 (k), c 



G N, co G (0, 1] 
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Extreme values in F lead to the existence of distinct local maxima x* 6 IB of / 
in the sense that there is no better candidate solution than x* in the neighborhood 
U* =1 iVj(£C*) even for relatively large k. Further, extreme local minima might completely 
prevent a local search heuristic from traversing the state space in certain directions. 
Consequently, local search algorithms, as discussed in Section 5.1.2, depend more heavily 
on their starting value, and their performance deteriorates with respect to particle-driven 
algorithms. 

We propose to draw the matrix entries from a discretized Cauchy distribution 

C c (»oc (l + (k/c) 2 )-\ cGN (17) 

that has heavy tails which cause extreme values to be frequently sampled. Figure 3 shows 
the distribution of a Cauchy and a uniform to illustrate the difference. The resulting 
UQBO problems have quite distinct local maxima; in that case we also say that the 
function fix) is strongly multi-modal. 



Figure 3: Histograms of a Cauchy C5 and a uniform U\q distribution. 
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5.3 Numerical comparison 

In this section, we provide numerical comparisons based on instances of the UQBO prob- 
lem. We generated two random test suites of dimension d = 250, each having 10 in- 
stances. For the first suite, we sampled the matrix entries from a uniform distribution 
Uioo on [—100,100]; for the second, we sampled from a Cauchy distribution C100 as 
defined in (17). For performance evaluation, we run a specified algorithm 100 times on 
the same problem and denote the outcome by x\ a; mo- 

5.3.1 Visualization 

Since the absolute values are not meaningful, we report the relative ratios 

f(xk) — worst solution found ^_ ^ 

best known solution — worst solution found ' ' 

where the best known solution is the highest objective value ever found for that instance 
and the worst solution is the lowest objective value among the 100 outcomes. We sum- 
marize the results in a histogram. The first n bins are singletons bk := {f?/j} for the 
highest values q* > ■ ■ ■ > g* € k E [1, 100]}; the following n bins are equidistant 
intervals := pi=pg£, n ~^ +1 g* ). The graphs show the bins b\, . . . , b n , bf, . . . , b< in 
descending order from left to right on the x-axis. The interval bins are marked with a 
sign "<" and the lower bound. The y-axis represents the counts. 

For comparison, we draw the outcome of several algorithms into the same histogram, 
where the worst solution found is the lowest overall objective value among the outcomes. 
For each algorithm, the counts are depicted in a different color and, for better readability, 
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with diagonal stripes in a different angle. To put it plainly, an algorithm performs well 
if its boxes are on the left of the graph since this implies that the outcomes where often 
close to the best known solution. 



5.3.2 Comparison of binary parametric families 

We study how the choice of the binary parametric family affects the quality of the 
delivered solutions. The focus is on the cross-entropy method, since we cannot easily 
use the Gaussian copula family in the context of sequential Monte Carlo. We use n = 
1.2 x 10 4 particles, set the speed parameter to /3 = 0.8 (or the elite fraction to 0.2) and 
the lag parameter to r = 0.5. 

The numerical comparisons, given in Figures 5(b) and 5(a), clearly suggest that using 
more advanced binary parametric families allows the cross-entropy method to detect 
local maxima that are superior to those detected using the product family. Hence, the 
numerical experiments confirm the intuition of our toy example in Figure 2. 

On the strongly multi-modal instance 5(a) the numerical evidence for this conjecture 
is stunningly clear-cut; on the weakly multi-modal problem 5(b) its validity is still un- 
questionable. This result seems natural since reproducing the dependencies induced by 
the objective function is more relevant in the former case than in the latter. 



Figure 4: The cross-entropy method using different binary parametric families. 
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5.3.3 Comparison of optimization algorithms 

We compare a sequential Monte Carlo sampler with parametric family, a sequential 
Monte Carlo sampler with a single- flip symmetric kernel (8), the cross-entropy method, 
simulated annealing and 1-opt local search as described in Section 4. 

For the cross entropy method, we use the same parameters as in the preceding sec- 
tion. For the sequential Monte Carlo algorithm, we use n = 0.8 x 10 4 particles and 
set the speed parameter to f3 = 0.9; we target a tempered auxiliary sequence (2). For 
both algorithms we use the logistic conditionals family as sampling distribution. With 
these configurations, the algorithms converge in roughly 25 minutes. We calibrate the 
sequential Monte Carlo sampler with local moves to have the same average run time by 
processing batches of 10 local moves before checking the particle diversity criterion. The 
simulated annealing and 1-opt local search algorithms run for exactly 25 minutes. 

The results shown in Figures 6(b) and 6(a) assert the intuition that particle methods 
perform significantly better on strongly multi-modal problems. However, on the easy test 
problems, the particle methods tend to persistently converge to the same sub-optimal 
local modes. This effect is probably due to their poor local exploration properties. Since 
particle methods perform significantly less evaluations of the objective function, they 
are less likely to discover the highest peak in a region of rather flat local modes. The use 
of parametric families aggravates this effect, and it seems advisable to alternate global 
and local moves to make a particle algorithm more robust against this kind of behavior. 

Further numerical results are shown in Figure 6 and Figure 7. 

6 Discussion and conclusion 

The numerical experiments carried out on different parametric families revealed that the 
use of the advanced families proposed in this paper significantly improves the perfor- 
mance of the particle algorithms, especially on the strongly multi-modal problems. The 
experiments demonstrate that local search algorithms, like simulated annealing and ran- 
domized 1-opt local search, indeed outperform particle methods on weakly multi-modal 
problems but deliver inferior results on strongly multi-modal problems. 

Using tabu lists, adaptive restarts and rounding heuristics, we can certainly design 
local search algorithms that perform better than simulated annealing and 1-opt local 
search. Still, the structural problem of strong multi-modality persists for path-based 
algorithms. On the other hand, cleverly designed local search heuristics will clearly beat 
sequential Monte Carlo methods on easy to moderately difficult problems. 

The results encourage the use of particle methods if the objective function is known 
to be potentially multi-modal and hard to analyze analytically. We have to keep in 
mind that multiple restarts of rather simple local search heuristics can be very efficient 
if they make use of the structure of the objective function. For 25 minutes of randomized 
restarts, the heuristic proposed by Boros et al. (2007), which exploits the fact that the 
partial derivatives of a multi-linear function are constant, practically always returns the 
best known solution on all test problems treated to create Figures 6 and 7. 

The numerical work was completely done in Python 2.6 using SciPy packages and run 
on a cluster with 1.86 GHz processors. The sources used in this work and the problems 
processed in this paper can be found at http://code.google.eom/p/smcdss. 
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Figure 5: Comparison of stochastic optimization algorithms on two UQBO problems. 
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8 Appendix: Fitting the parameters 

We briefly summarize how the parameters of the logistic conditionals family and the 
Gaussian copula family can be assessed for a given particle system X = (sci, . . . ,cc n ) T 
with weights w = (w\, . . . , w n ). We denote by 

Xi ■=Y2=i w k x ki, Xij :=J2k=i w k x kiXkj, i,je{l,dl (18) 
the weighted first and second sample moments. 
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Figure 6: Comparison of stochastic optimization algorithms. 10 problems f(x) = x J Fx 
with fij ~ Cioo for i,j G [1, 250] 
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Figure 7: Comparison of stochastic optimization algorithms. 10 problems f{x) 
with fij ~ U 100 for i,j € [1, 2501 
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8.1 Logistic conditionals family 

8.1.1 Derivatives of the log-likelihood function 

The log- likelihood function of the weighted logistic regression of := X.j on := 
(X.i :i _i, 1) is 

n 

logL(a) = J> fc \y®]og[£(z®a)] + (1 - yf) log[l - £(z«a)] 



k=l 



w k y$z%a - log[l + exp(4l ) a 



fc=i 



where we used that log[l — ^(a; T a)] = — log[l + exp(a; T a)] = — x J a + log[£(x J a)]. Since 
91og[l + exp(x J a)] / da = £(x J a)x, the gradient of the log-likelihood is 



(«) = £ 



(Z»)Tdiag(^)[y«-pk i) ] 



fc=i 

where (Po^)fc := ^( z i» a )- Since d£(x J a)/da = — £(x J a)[l — £(x J a)]x, the Hessian matrix 
of the log-likelihood is 

n 

s'(a) = -j>* [t(z®a)[l-i(z®a)]] *j£(*£) T = -(Z«)Tdiag( W )diag(q«)zW, 
fc=i 



where (q®) k := ^o)[l - *(z£a)]. 



8.1.2 Complete separation 

The data might suffer from complete or quasi-complete separation Albert and Anderson 
(1984) which causes the likelihood function L(a) to be monotonic. In that case there is 
no maximizer. We can avoid the monotonicity by assigning a suitable prior distribution 
to the parameter a. Firth (1993) recommends the Jeffreys prior for its bias reduction 
which can conveniently be implemented via data adjustment Kosmidis and Firth (2009). 

For the sake of simplicity, however, we only assign a simple Gaussian prior with 
variance e^ 1 > such that, up to a constant, the log-posterior distribution is the log- 
likelihood function plus a quadratic penalty term and therefore always convex. The score 
function and its Jacobian matrix become 

s(a) = (Z«) T diag(u;)[yW - ea, s'(a) = -(Z«) T diag(w)diag(qi i) )Z« - el. 

The bias-reduced estimators are known to shrink towards 0.5 which is an undesired 
property when fitting a parametric family. Therefore, we attempt to keep the shrinkage 
parameter e as small as possible. 

8.1.3 Newton-Raphson iterations 

The first order condition s{a) = is solved iteratively 

(*+D = a (t) _ [ s '( (*))]-i s ( a (*)) = (*) + X V) 
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where x^ is the vector that solves 



(Z«)Tdiag(«;)diag(^ ( ) t) )(Z«)T + el 



x 



(t) 



(Z«)Tdiag(ti>)[y 



(0 



v {i) 1 



ea 



(i) 



If the Newton iteration at the ith component fails to converge, we can either augment 
the penalty term e which leads to stronger shrinkage of the mean m« towards 0.5 or 
we can drop some covariates for j G {1, . . . ,i — 1} from the iteration to improve the 
numerical condition of the procedure. 

In practice, we drop the predictors from the regression model which are only weakly 
correlated with the explained variable. This step is important to speed up the algorithm 
and improve its numerical properties. For a proposal distribution, it is particularly 
important to take the strong dependencies into account but it is often sufficient to work 
with very sparse logistic conditionals families. 

In particularly difficult cases, we might prefer to set an = £ (xi) and a^i^-i = 0, 
where Xi is defined in (18), which guarantees that at least the mean is correct. This is 
an important issue since misspecification of the mean of ji also affects the distribution 
of the components jj for j £ {i + 1, . . . , d} which are sampled conditional on 7,. 



Algorithm 4: Fitting the weighted logistic regressions 



Input: X = (asi, . . . ,cc„) T , w = (wi, . . .,W n ), 
for i e [1, d] do 

Z <- (X. 1:i _i, 1), y <- X.i, a(°) <- A itl:i 
repeat 

p k ^l(Z k .aW) for all ke [l,n] 
qk «- Pfc(l -Pk) for all he [l,n] 



A 6 



pt/x d 



,.(*+!) 



(Z (i) )TdiagHdiag[q]Z (i) + el 
(Z( i} )Tdiag[w;] [y-p]-ea^ 



until ||a( t+1 ) -aW||oo < 6 

Ai,i:j a 
end 

return A 



8.2 Gaussian copula family 

We adjust the Gaussian copula family g£ c s by method of moments. We need to solve 
the non-linear equations 

#i(aj)=Si, & 2 (ai,aj;aij) = Xij, [1,4 (19) 

where Xi and Xjj are defined in (18) while &\(vi) and ^2(^5 Vj-; cr^) denote the cumulative 
distribution functions of the univariate and bivariate normal distributions with zero 
mean, unit variance and correlation coefficient er^ G [—1,1]. 

While the parameter a« = ^^~ 1 (xj) is easy to assess, the challenging task is to compute 
the bivariate variances Oij for all i,j G [1, d}. Recall the standard result [(Johnson et ah, 
2002), p.255] 

d<P 2 {xi,x 2 ;a) 

= (p{x l ,x 2 ; a), (20) 
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where <p(-, •; <r) denotes the density of the bivariate normal distribution with correlation 
coefficient a. We obtain the following Newton-Raphson iteration 

g (m) =ff («)_^^4 ) )-% > (21) 

starting at some initial value crf^ £ (—1,1); see Procedure 5. In the sequential Monte 
Carlo context, good initial values are obtained from the parameters of the previous 
auxiliary distributions. 

We use a fast series approximation Drezner and Wesolowsky (1990) to evaluate ^2(0*5 a j\ a ij) 
These approximations are critical when Gij comes very close to either boundary of [—1, 1]. 
The Newton iteration might repeatedly fail when restarted at the corresponding bound- 
ary 6 { — 1,1}. In any event, ^2(^1 ; x 2] cr) is strictly monotonic in a since its 
derivative (20) is positive, and we can switch to bi-sectional search if necessary. 

Algorithm 5: Fitting the dependency matrix 
Input: Xi, Xij for all i,j E 
a, t <- ^- x {xi) for all i e [1, d] 
E<°> <- I 

for i,j e [l,dj, i < j do 
repeat 

(T. 



(t + 1) (t) *2(Oi,aj!CTy) "Si. 



untill^-^K^ 

-a - -g +1) 

end 

if not S >- then S <- (S + |A|I)/(1 + |A|) 
return a, S 



The locally fitted correlation matrices 5] might not be positive definite for d > 3 
because the Gaussian copula is not flexible enough to model the full range of cross- 
moments binary distributions might have (see Schafer (2012) for an extended discussion). 
We obtain a feasible parameter replacing XI by 

£* = (E + |A|I)/(1 + |A|), 

where A is smaller than all eigenvalues of the locally fitted matrix S. This approach 
evenly lowers the local correlations to a feasible level and is easy to implement on stan- 
dard software. In practice, we also prefer to work with a sparse version of the Gaussian 
copula family that concentrates on strong dependencies and sets minor correlations to 
zero. 
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